feat(index): thread DataFusion MemoryPool through IVF index build pipeline#7312
Draft
wjones127 wants to merge 1 commit into
Draft
feat(index): thread DataFusion MemoryPool through IVF index build pipeline#7312wjones127 wants to merge 1 commit into
wjones127 wants to merge 1 commit into
Conversation
…eline Implements the pool-threading part of lance-format#7305. - Add `make_index_memory_pool()` in `utils.rs`: reads `LANCE_INDEX_MEMORY_BUDGET` env var (bytes); returns a `GreedyMemoryPool` at that limit or an unbounded pool (existing default behavior). - Add `memory_budget` parameter to `create_ivf_shuffler`; when set, sizes `TwoFileShuffler::batch_size_bytes` via `batch_size_from_budget(budget)` (50% of budget, floor 128 MB). - Add `IvfIndexBuilder::with_memory_pool`; each per-partition build acquires a `MemoryReservation` before loading data. `try_grow` returning `Err` is logged as a warning (actual spill reaction is deferred to lance-format#7300). - Wire `make_index_memory_pool` through all three IVF build entry points: `build_distributed_vector_index`, `build_vector_index`, and `build_vector_index_incremental` in `vector.rs`, and `optimize_vector_indices_v2` in `ivf.rs`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #7305. Part of epic #7301.
What
Threads a per-build
Arc<dyn MemoryPool>through the IVF vector index build pipeline (shuffle phase + per-partition sub-index construction).utils.rs— newmake_index_memory_pool(): readsLANCE_INDEX_MEMORY_BUDGETenv var (bytes). Returns aGreedyMemoryPoolat that limit, orUnboundedMemoryPool(no change to existing behavior when unset).shuffler.rs—create_ivf_shufflergains amemory_budget: Option<usize>parameter. When set,TwoFileShuffler::batch_size_bytesis sized asmax(budget / 2, 128 MiB)instead of the fixed 128 MiB default.builder.rs—IvfIndexBuilder::with_memory_poolbuilder method. Each per-partition build acquires aMemoryReservationbefore loading partition data.try_growreturningErris the spill signal per the issue spec; for now it logs a warning and continues (actual spill reaction deferred to #7300).vector.rs/ivf.rs— all three IVF build entry points (build_distributed_vector_index,build_vector_index,build_vector_index_incremental,optimize_vector_indices_v2) callmake_index_memory_pool(), pass the budget tocreate_ivf_shuffler, and call.with_memory_pool(pool)on everyIvfIndexBuilder.Testing
New unit tests for
batch_size_from_budgetandcreate_ivf_shufflerbatch-size-from-budget behavior. All 135index::vectortests pass.